We are going to perform EDA on the AMS 2013-2014 Solar Energy Prediction Contest using R programming language. AMS 2013-2014 Solar Energy Prediction Contest Forecast daily solar energy with an ensemble of weather models
Data source :https://www.kaggle.com/competitions/ams-2014-solar-energy-prediction-contest
library(readr) # Data reader
library(dplyr) # A grammar of data manipulation
library(tibble) # Modern take on data frames.
library(dlookr) # Tools for Data Diagnosis, Exploration, and Transformation
library(DataExplorer) # Automate Data Exploration and Treatment
library(skimr) # Useful summary statistics
library(lubridate) #Using wday function to tell the day of the week
library(ggplot2) # Data visualization
File with name, latitude, longitude, and elevation of each of the 98 stations.
station_info <- read_csv('~/R_Programming/station_info.csv')
The real values of solar production recorded in 98 different weather stations ranging from 1994-01-01 to 2012-11-30
solar_dataset <-readRDS(file = '\~/R_Programming/solar_dataset.RData')
The 100 original variables detected as more important to predict the first station (column 2 ACME) values, after feature importance analysis
additional_variables <- readRDS(file = '\~/R_Programming/additional_variables.RData')
We are going to perform EDA of the solar_dataset which is the main dataset.
total dimension of 6909 rows and 456 columns.
dim(solar_dataset)
glimpse(solar_dataset)
Next, we starts to the data cleaning process.Let’s look for missing values using the DataExplorer package.
plot_missing(solar_dataset, missing_only = TRUE) options(repr.plot.width=8, repr.plot.height=3)
plot_missing(solar_dataset)
Looking at the size of the dataset and the missing value plot, it seems as if we can remove the missing values and still have a good-sized set of data to work on, so let’s start by doing that.
solar_dataset <- na.omit(solar_dataset) dim(solar_dataset)
New dimensions : 5113 rows and 456 columns
Let’s convert the column Date (character) to datetime. It’ will’s going be useful for our EDA.
solar_dataset$Date <- as.Date(solar_dataset$Date, format = "%Y%m%s")
Then, we create new columns for year, month, and day using Tidyr :: Separate
solar_dataset_dt <- tidyr::separate(solar_dataset, Date, c('year', 'month', 'day'), sep = "-",remove = FALSE)
We will also create a new variable that tells us the day of the week, using the wday function from the lubridate package.
solar_dataset_dt$dayOfWeek <- wday(solar_dataset_dt$Date, label=TRUE)
Because the Principal components won’t be used in this study. We will exclude all PC columns in order to consider only the real values of solar production recorded in 98 different weather stations
sub_solar_dataset <- subset(solar_dataset_dt, select = -c(100:456))
dim(sub_solar_dataset) head(sub_solar_dataset)
New dimension : 5113 rows and 103 columns
Let’s check the data size and structure.
glimpse(sub_solar_dataset)
To review the useful statistics, we apply the Skimr library.
skim(sub_solar_dataset)
Let’s look for the ouliers.
diagnose_outlier(sub_solar_dataset)
Fortunately, there’s no outlier found.
See the outlier diagnosis plot
sub_solar_dataset %\>% plot_outlier(BOIS)
#change the name of the station in () to see the differnt plots
Using diagnose_web_report to check out data. It generates the previous steps of data diagnosis automatically.
diagnose_web_report(sub_solar_dataset)
Let’s review the descriptive statistics of our dataset.
See the statistics summary
summary(sub_solar_dataset)
See descriptive statistics
describe(sub_solar_dataset)
Check normality
normality(sub_solar_dataset)
If p-value =< alpha A , data isn’t normalized. Fortunately, we are good here.
Plot normality
sub_solar_dataset %\>% plot_normality(BOIS)
Plot density
plot_density(sub_solar_dataset)
By running correlation matrix, we can improve this part by extracting the columns date to datetime, day, month, and year.
correlate(sub_solar_dataset)
Plot correlation
matrix plot_correlate(sub_solar_dataset)
It seems like we have too many features. The plot isn’t visible.
Create EDA report using only One Line code
eda_web_report(sub_solar_dataset)
Let’s plot the location of each solar station vs thier elevation.
g \<- list( scope = "usa", projection = list(type = "albers usa", scale = 1), showland = TRUE, landcolor = toRGB("gray95"), subunitcolor = toRGB("gray85"), countrycolor = toRGB("gray85"), countrywidth = 0.5, subunitwidth = 0.5 )
fig \<- plot_geo(station_info, lat = \~nlat, lon = \~elon) #fig \<- plot_geo(station_info, lat = \~nlat, lon = \~elon) fig \<- fig %\>% add_markers( text = \~paste(stid, paste("Elevation (m):", elev), sep = "<br />"), color = \~elev, symbol = I("circle"), size = I(20), hoverinfo = "text" ) fig \<- fig %\>% colorbar(title = "Elevation (feet)") fig \<- fig %\>% layout( title = "Station info.", geo = g )
fig
We see that the solar_dataset has to many features (station). This is a big challenge in performing EDA by R programming. Becasue it requires a proper data preparation to make it easy to visualize. I believe that working further of plotting day of week, month and year VS #sum of solar energy of each station will give us more insight of this dataset.